For a long time plotting large quantities of data in python notebook wasn't exactly fun. Classical plotting packages, sich as matplotlib, seaborn, plotly and bokeh, were not able to process large quantities of data into points - especially, if we talk about interactive visualisation.
However, recently, a few new technologies start gaining attention. In particular, I am talking about JIT (just-in-time-complier) and corresponding package for python - Numba. Numba allows to boost standart numberical computations in python - in fact, traversing through python loop with numba often even faster, than using numpy!
On the other hand, Dask had arrive - a sort of a lightweight multiprocessing wrap around numpy and pandas, helping to work on large (larger than RAM) datasets.
Mixing the two, Bokeh team introduced a datashader - new awesome plotting library, that calculates and plots large datasets into image or intaractive visualisation. Datashader is superawesome (yet, very young and still changes it's API)
In this notebook I wil show, how to plot a large dataset (36 million points, in this case) on a single machine using two new libraries, - dask and datashader
First, of course, we need to import all modules we want
%matplotlib inline
import pylab as plt
from ipynotifyer import notifyOnComplete as nf
import numpy as np
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf
from dask import dataframe as dd
import dask
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, viridis, inferno
from IPython.core.display import HTML, display
from pyproj import Proj # reproject points to State Plane
nyc = Proj(init='epsg:2263')
def reproj(df, prj=nyc):
d = nyc(df['lon'].values, df['lat'].values)
df[['x','y']] = pd.DataFrame({'x':d[0],'y':d[1]})
return df
Get the data¶
Now, let's get the dataset loaded.
dsk = dd.read_csv('data/data*.csv', encoding='utf8')
Let's count rows
len(dsk) # size of the dataset
Process the data¶
- column as categorical, lower
dsk = dsk.assign(application=dsk.application.str.lower())
- reproject to NYC state plane
dsk = dsk.map_partitions(reproj)
- add daytime in seconds
dsk = dsk.assign(daytime=dsk.timestamp.mod(86500))
Now let`s play with dask graph visualisation, just because it is awesome. As we can see, data is split into many "chunks" of data, and for each a set of transformations is performed (all operations are row-wise for now).
dsk.visualize()
And now let's actually compute the result.
d = dsk.compute()
VISUALISATION¶
Now lets prepare to visualise our map using datashader.
First, let's define a canvas size
plot_width = int(1000)
plot_height = plot_width
background = "black"
Datashader examples propose to use partial helper - we don't want to define background stile every time
export = partial(export_image, background = background)
cm = partial(colormap_select, reverse=(background!="black"))
Also, we need notebook to be large
display(HTML(""))
Now let's define our data-side canvas coordinates. we can simply reproj them from lot/lan as well
sw = nyc( -74.15, 40.463661 ) # reproj
ne = nyc( -73.66, 40.947435 ) # reproj
NYC = x_range, y_range = zip(sw, ne)
cvs = ds.Canvas(plot_width, plot_height, *NYC)
Density¶
First, lets just count tweets for each point.
count = cvs.points(d, 'x', 'y')
Lets start with linear color interpolation. That means, that difference in color or/and brightness between two pixels is linearly proportional to their corresponding values. ost of the time, it is a bad idea, as a few spots will overcome general population. Yet, lets give it a try.
export(tf.interpolate(count, cmap = Greys9, how='linear'),'tweets_density_linear')
As we expected, it is really not helping,s o lets stick with equal histogram. Equal histogram means, that for each color in the colormap, buckets are adjusted, so tha each color represents equal number of points
export(tf.interpolate(count, cmap = Greys9, how='eq_hist'),'tweets3')
Now, grey is kinda boring, lets change a color scheme. Worth noticing, that we don't do any heavy computation here - all counts are done already in .points()function. All we are doing now, is printing a 2d-matrix
export(tf.interpolate(count, cmap=viridis, how='eq_hist'), 'colored_total')
Applications¶
Now, lets define, which of 4 top application is the most popular for each point
I actually started defining colors. Strange thing to start with, but this way I am able to use keys to filter apps later
if background == "black":
color_key = {'foursquare':'aqua',
'twitter for iphone':'white',
'instagram':'red',
'twitter for android':'lime'}
else: color_key = {'foursquare':'blue',
'twitter for iphone':'white',
'instagram':'red',
'twitter for android':'lime'}
Filter data for top-4 applications, just as with pandas
appDf = d[d.application.isin(color_key.keys())]
Now, lets turn application to categorical type
appDf = appDf.assign(application=appDf.application.astype('category'))
appDf.application.value_counts()
Now count by category
appCount = cvs.points(appDf, 'x', 'y', ds.count_cat('application'))
And plot
export(tf.colorize(appCount, color_key, how='eq_hist'), 'colored_apps')
Daytime¶
now, lets visualise our daytime. Here, I use "hsv" colormap, as I want numbers for 00:05 and for 23:55 to be close enough.
Also, I remove noise (points with less than 10 tweets), using count aggregate, which we already computed
treshold = 10
aggDaytime = cvs.points(d, 'x', 'y', agg=ds.mean('daytime'))
colormap = plt.get_cmap('hsv')
export(tf.interpolate(aggDaytime.where(count > treshold ), cmap=colormap, how='eq_hist'), 'colored_total')
Datashader is incredible! Next time i will play with an interactive part of it
feel free to ask / suggest anything via casyfill@gmail.com